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Performance  Limits  of  the  CLASSIE  Circuit  Simulation  Program" 

The  performance  of  the  new  LSI  simulator  CLASSIE  is  evaluated  on 
several  circuits  with  a  few  hundred  to  over  one  thousand  semiconduc¬ 
tor  devices.  A  more  accurate  run  time  prediction  formula  has  been 
found  to  be  appropriate  for  circuit  simulators.  The  design  decisions 
for  optimal  performance  under  the  constraints  of  the  hardware  (CRAY- 
1)  are  presented. 

Circuit  Simulation  on  Vector  Processors"  Vector  computers  have  an 
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increased  potential  for  fast,  accurate  simulation  at  the  transistor 
2eve"l  of  Large-Scale-Integrated  Circuits.  Design  considerations 
for  a  new  circuit  simulator  are  developed  based  on  the  specifics  of 
the  vector  computer  architecture  and  of  LSI  circuits.  The  perfor¬ 
mance  of  the  new  LSI  simulator,  CLASSIE,  is  evaluated  on  a  CRAY-1 
vector-computer  for  several  circuits  with  a  few  hundred  to  over  a 
thousand  semiconductor  devices.  Comments  are  given  concerning  the 
performance  limits  and  relative  hardware  dependence. 

"LSI  Circuit  Simulation  on  Attached  Array  Processors"  The  simula¬ 
tion  of  Large-Scale-Integrated  (LSI)  circuits  requires  very  long 
run  time  on  conventional  circuit  analysis  programs  such  as  SPICEZ 
and  supermini  computers.  A  new  simulator  for  LSI  circuits,  CLASSIE 
which  takes  advantage  of  circuit  hierarchy  and  repetitiveness,  and 
array  processors  capable  of  high-speed  floating-point  computation 
are  a  promising  combination.  The  program  development  software 
environment  of  the  Floating  Point  Systems  164  is  evaluated  based  on 
the  experience  gained  with  the  conversion  of  both  SPICZ  and  CLASSIE 
to  the  machine.  The  FPS-164  has  been  used  as  an  attached  processor 
to  a  VAX  11/780  with  the  UNIX  operating  system.  The  performance  of 
the  two  simulation  programs  on  the  host  computer,  the  VAX,  and  the 
attached  processor  is  compared.  The  FPS-164  architecture  and  Fortri 
compiler  are  evaluated  by  means  of  the  speedup  of  CLASSIE  compared 
to  SPICEZ  on  the  same  processor. 

"Data-Flow  Based  Behavioral-Level  Simulation  and  Synthesis" 

While  a  large  number  of  powerful  design  verification  tools  have  beet 
developed  for  IC  design  at  the  transistor  and  logic  gate  levels, 
there  are  very  few  silicon-oriented  tools  for  architectural  design 
and  evaluation.  As  the  number  of  gates  which  can  be  implemented  on 
a  single  chip  grows,  these  tools  are  becoming  increasingly  important 
The  FTL2  system  described  in  this  paper  is  an  interactive  system 
for  specifying  concurrent  digital  systems  and  analyzing  their  be¬ 
havior.  FTL2  differs  from  other  behavior-level  simulation  systems 
in  that  the  input  specification  for  a  circuit  is  a  concurrent  pro¬ 
gram.  Specifications  are  incrementally  compiled  into  augmented 
data-flow  graphs  which  are  then  interpreted  by  a  software  data-flow 
machine.  FTL2  includes  special  control  structures  for  describing 
concurrent  behavior  in  a  structured  fashion,  a  number  of  user-oriented 
input  features,  and  an  extensive  macro  facility.  The  concept  of 
non-sharable  resources  is  used  to  determine  timing-dependent  module 
access  conflicts  in  a  value  independent  manner.  Incomplete  specifi-|- 
cations  can  be  emulated  and  can  be  modified  interactively.  FTL2 
has  been  implemented  in  LISP  and  is  currently  operational.  A  com¬ 
panion  FLT2-based  synthesis  system  is  currently  under  development. 
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Grant  AFOSR-8 1-002 1  became  effective  October  1.  1981  for  research  studies 
in  [Project  Title].  Although  originally  intended  to  support  research  activities 
over  a  12  month  period,  because  of  the  availability  of  complimentary  support 
from  industrial  sources,  it  was  possible  to  extend  the  research  activities  through 
July  31,  1983.  In  the  following,  the  major  activities  associated  with  this  research 
grant  are  summarized  with  particular  attention  to  the  research  publications 
which  have  resulted. 

For  this  research  effort,  it  is  a  pleasure  to  acknowledge  the  cooperation 
received  from  Professor  Donald  Calahan  of  the  University  of  Michigan,  Ann 
Arbor.  Professor  Calahan  was  instrumental  in  establishing  initial  contacts  with 
AFOSR  for  our  grant.  Over  the  years,  Calahan's  research  group  at  Michigan  and 
ours  have  worked  closely  together  in  the  area  of  the  effective  use  of  parallelism 
and  vector  computation.  Calahan's  research,  supported  by  AFOSR  80-0152,  has 
been  particularly  helpful  to  us  and  is  described  in  the  two  publications  cited 
below.* 

RESEARCH  PERSONNEL 

The  research  personnel  associated  with  this  grant  include  Professors  New¬ 
ton  and  Pederson  together  with  Dr.  Andrei  Vladimirescu.  During  the  initial 
period  of  the  grant,  Dr.  Vladimirescu  was  completing  his  pre-doctoral  studies, 
which  were  {jointly  supported  by  Bell  Laboratories.  In  addition,  he  received 
extensive  computer  availability  on  a  CRAY  computer  from  United  Information 
Services.  His  pre-doctoral  work  included  the  development  of  the  circuit  simula- 

*  D.A  Calahan,  "Multilevel  Vectorized  Sparse  Solution  of  LSI  Circuit*",  Proc.  ICCC  '80. 

*  D.A  Calahan,  "Decoupled  Solution  of  Circuit  Matricea  on  Pipelined  Processes",  Proc.  ICCC  '82, 

pp  SS7- _ 


tor  program  CLASSIE.** 


During  the  second  year  of  the  grant  period,  Dr.  Vladimirescu  continued  as  a 
post-doctoral  scholar  concentrating  on  the  extension  of  his  earlier  research 
work  to  the  use  of  array  processors.  Also  in  the  second  year  of  the  grant.  Pro¬ 
fessor  Newton  gave  attention  to  the  additional  topic  of  behavioral  simulation  and 
synthesis  based  on  data-flow  machines. 


**  A.  Vladimirescu  end  D  O.  Pederson,  ''Circuit  Simulation  on  Vector  Processors',  Proc  ICCC  '62. 


RESEARCH  PUBLICATIONS 


Four  publications  have  resulted  from  this  research  grant.  Three  of  these 
publications  concern  circuit  simulation  using  vector  computers  or  array  proces¬ 
sors  while  the  fourth  concerns  behavioral  simulation  on  advanced  computers.  In 
the  following,  abstracts  of  these  papers  are  included. 

"Performance  limits  of  the  CLASSIE  Circuit  Simulation  Program" 

A.  Vladimirescu  and  D.O.  Pederson 
Proc.  ISCAS  82 

Abstract: 

The  performance  of  the  new  LSI  simulator  CLASSIE  is  evaluated  on  several 
circuits  with  a  'few  hundred  to  over  one  thousand  semiconductor  devices.  A 
more  accurate  run  time  prediction  formula  has  been  found  to  be  appropriate  for 
circuit  simulators.  The  design  decisions  for  optimal  performance  under  the  con¬ 
straints  of  the  hardware  (CRAY-1)  are  presented. 

"Circuit  Simulation  on  Vector  Processors" 

A.  Vladimirescu  and  D.O.  Pederson 
Proc.  ICCC  ’82 

Abstract: 

Vector  computers  have  an  increased  potential  for  fast,  accurate  simulation 
at  the  transistor  level  of  Large-Scale-Integrated  Circuits.  Design  considerations 
for  a  new  circuit  simulator  are  developed  based  on  the  specifics  of  the  vector 
computer  architecture  and  of  LSI  circuits.  The  performance  of  the  new  LSI 
simulator,  CLASSIE,  is  evaluated  on  a  CRAY-1  vector-computer  for  several  cir¬ 
cuits  with  a'  fewihundred  to  over  a  thousand  semiconductor  devices.  Comments 
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are  given  concerning  the  performance  limits  and  relative  hardware  dependence. 


"IS  Circuit  Simulation  on  Attached  Array  Processors" 

Andrei  Yladimirescu 

Abstract: 

The  simulation  of  Large-Scale-Integrated  (LSI)  circuits  requires  very  long 
run  time  on  conventional  circuit  analysis  programs  such  as  SPICE2  and  super¬ 
mini  computers.  A  new  simulator  for  LSI  circuits,  CLASSIE,  which  takes  advan¬ 
tage  of  circuit  hierarchy  and  repetitiveness,  and  array  processors  capable  of 
high-speed  floating-point  computation  are  a  promising  combination. 

The  program  development  software  environment  of  the  Floating  Point  Sys¬ 
tems  184  is  evaluated  based  on  the  experience  gained  with  the  conversion  of 
both  SPICE2  and  CLASSIE  to  the  machine.  The  FPS-164  has  been  used  as  an 
attached  processor  to  a  VAX  11/780  with  the  UNIX  operating  system. 

The  performance  of  the  two  simulation  programs  on  the  host  computer,  the 
VAX,  and  the  attached  processor  is  compared.  The  FPS-164  architecture  and 
Fortran  compiler  are  evaluated  by  means  of  the  speedup  of  CLASSIE  compared 
to  SPICE2  on  the  same  processor. 

"Data-Flow  Baaed  Behavioral-Level  Simulation  and  Synthesis” 

J.T.  Deutsch  and  A.R.  Newton 
Proc.  ICC  AD  83 

Abstract- 

While  a  large  number  of  powerful  design  verification  tools  have  been 
developed  for  1C  design  at  the  transistor  and  logic  gate  levels,  there  are  very  few 
silicon-oriented  tools  for  architectural  design  and  evaluation.  As  the  number  of 
gates  which  can  be  implemented  on  a  single  chip  grows,  theseTools  are  becom- 


ing  increasingly  important. 

The  FTL2  system  described  in  this  paper  is  an  interactive  system  for  speci¬ 
fying  concurrent  digital  systems  and  analyzing  their  behavior.  FTL2  differs  from 
other  behavior-level  simulation  systems  in  that  the  input  specification  for  a  cir¬ 
cuit  is  a  concurrent  program.  Specifications  are  incrementally  compiled  into 
augmented  data-flow  graphs  which  are  then  interpreted  by  a  software  data-flow 
machine. 

FTL2  includes  special  control  structures  for  describing  concurrent  behavior 
in  a  structured  fashion,  a  number  of  user-oriented  input  features,  and  an  exten¬ 
sive  macro  facility.  The  concept  of  non-sharable  resources  is  used  to  determine 
timing-dependent  module  access  conflicts  in  a  value  independent  manner. 
Incomplete  specifications  can  be  emulated  and  can  be  modified  interactively. 

FTL2  has  been  implemented  in  LISP  and  is  currently  operational.  A  com¬ 
panion  FLT2-based  synthesis  system  is  currently  under  development. 
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PERFORMANCE  UMTTB  Of  THE  CLASSIE 
CIRCUIT  SIMULATION  PROGRAM 


A«M  VMmtnn  tad  Donald  0.  hdwn 

Pnwnl  tt  Electrical  Engineering  tad  Computtr  Science*. 
Eteciroaics  Reenrcb  La  bon  too, 

Uaivardty  sT  Calif  omit,  bitthr,  California  *4720. 


ABSTRACT: 

TVa  arinut  tf  the  art  LSI  d* 
•tahaMd  aa  vail  cfeoltt  iM  a  Hat 
thraiaad  temhondwcev  devlnta.  A 
predktieo  hrmala  hat  haaa  head  la  ha 
dmlaan.  Tht  dnl|a  dadilaai  Ihr  aytiaal 
At  maaoelaa  af  tht  hardwtn  (CRAY-1)  art 


L  latradacttaa 

CLASSIE  111.  a  am ula boo  pragma  far  lirge-ecale-laMgraud 
dmim,  hat  kata  dtralopad  10  narrow  down  tht  meed  p arfor- 
msoce  gap  fearvaca  a  circuit  amulaior  tad  a  tuning  dmulator 
#vm  ta  SMO  (angle  iattructioe,  multi  pit  dele)  archittctura  of 
•a  ho*  computer,  a 4..  CRAY- 1  la  apiia  ta  inherent  iecraaat  la 
tht  axaculioa  gaad  of  SPICE2  whan  raa  aa  the  CRAY-1,  aa 
additional  orda  of  magnitude  iacreaat  la  aaadad  for  the  aSctaal 
aatlrai  of  LSI  orcuitt  (potter  thaa  oat  thoutand  deviom). 

Ip  natng  factor  capability  the  pood  improvement  la  CLAS¬ 
SIE  it  due  to  ordering  operatioot  which  can  ha  performed  la  paral¬ 
lel  121  That,  tubcscuitt  with  ideniical  topology  tad  eemtaaduc- 
tat  davit aa  dttcrihad  bp  the  Me  model  art  evaluated  la  the  vac¬ 
ua  modi,  la  addnioa  tht  uat  af  generated  machine  cadi  far  iha 
matrix  tolulno  pncticaOp  raducaa  the  contribution  af  that  pvt  of 
Be  dmulatioa  m  lam  than  ana  Sfth  of  tht  total  aaalyaw  dam. 
avaa  for  vary  largi  dreuitt-  Finally  the  dua  ttruetuta  baa  baaa 
man  Scatty  tailored  m  aacooaodtm  vector  oparatioaa  allh  minimal 
pMhv/matMr  ovathaad 

The  dtvalnpmaat  af  CLASSIE  hat  poat  through  mwd 
aagaa  A  mat  iaapertaat  pragma  la  a  voctoeiaed  rordoa  af 
STICE  CSFICEV)  Which  eoataiaa  vacterlaad  davica  modal  raadaaa 

hvdv-Maoh  dhgoad  manta  aohwr  af  CLASSIC,  hamaadpmfv- 

mama  h  . . *y  half  af  Bat  af  CLASSIE  amd  « lee  times 

(batm  Aaa  WKM  dtptadiag  aa  chttdt  dan. 


kill 


*  at 


h  t  attack 


TT  af 


kb 


IS  •  1.4  of  tht  circuit  com  plenty  *N'.  Crcuiu  eaxlyted  wuh 
CLASSIE  aa  t  CRAY-1  contradict  both  of  the  above  genertliza- 

A  more  qpeurate  time  model  it  Decenary  in  order  to  predict 
tht  performance  of  1  circuit  timulator.  Semiconductor  devices  are 
npreaented  in  moot  circuit  emulator*,  aucb  aa  SPICE2  12).  1*1. 
(SI,  ADVICE  IS),  ASFEC  (7).  SUC  (II  and  CLASSIE,  by  a  vari¬ 
able  number  of  aquation  depending  on  whether  paras tx  terminal 
irwitaoret  are  apaciflad  or  not.  For  this  reason  there  io  aa  arbi¬ 
trary  relation  between  the  number  of  devices  and  that  of  nodes  u 
brought  out  la  the  large  circuit  ciamplct  listed  below. 

Two  peruse  ten  have  been  round  to  provide  t  rather  accurate 
characterirxtJoe  of  the  emulation  time.  The  parameters  refer  to 
Iha  two  motor  parts  of  the  analysts  semiconductor  -device-model 
evaluation  (lacobiu  terms)  tad  h  near -equation  solution.  Parame¬ 
ter  l<  il  the  time  for  ooe  model  evaluation  in  one  iteration  Die 
rocoad  parameter  is  t.  the  time  for  eolving  one  squatioo  in  one 
hsmioo.  The  analysu  time  oa  then  be  ettimated  as 

T  -  OavCnsXta  *  a,«tj  ♦  overhead  (I) 

where  0  rape  taints  the  number  of  entities  dratpiietod  by  the  sub¬ 
script.  it.  aM  it  the  iteration  number,  a*  the  number  of  devices 
and  a,  the  aumbv  of  equalioon  From  above  it  ■  obvious  that 
the  moedup  ooe  cu  pet  in  circuit  simulation  depend!  on  how 
much  tht  two  charactcrittic  timet  ran  be  reduced.  The  particular 
dreuil  determine*  the  raltlivt  weight  of  the  two  terms  in 
pens  these*  Tht  taalyeie  lime  for  e  aubcircuit  oriented  program 
hke  CLASSIE  aaa  be  npreaaad  at. 

T  -  a»In**t«  ♦  a«»t«  ♦  •w^fa*”'*  ♦  a.«tjl 


when  a.  it  the  number  of  aubcirruite  tad  the  second  eubocript  V 
■Made  fer  mtercoanartioa  while  V  etude  foe  eubcirciut  A* 
dhown  pravioudy  t,  cm  be  reduced  greatly  by  code  pe aeration  (1) 
In  ordv  m  reduce  t«  (pad  tph  the  vector  praeemor  eaviroo- 
maat  M  h  obvious  thd  tht  deviem  be**  m  bo  pougad  other  by 
Mbdrcuii  and/or  by  modd  and  eveluemd  la  panDeL  Iha  model 
evaluation  cm  ba  broken  dawn  lam  actud  carapu ration,  e  g  . 
agulvakat  condueraacae  end  chargee,  sad  panatrar  gtthenag  end 
individual  tdmituaca  mattering  tarn  the  circuit  matrix.  If  the  dev- 
daa  are  grouped  bp  model*  enty  the  oomputatioe  pan  cm  be  *ec- 
mrtied  oheram  If  grouped  bp  mbcircuita.  both  cooiputatiotial  and 
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dem—  mhaiiot  rim—  t*  I*  Mi  t*  mo  dlfftre—  f—  da  aerot 

UM 

The  dmo  — nnp  la  atM  rvthiatiaa  —a,  homes—,  aal 
hppnaaino  kafin—  of  —  *anl  m— oaa  Fint,  aoM  rout—e  nec- 
Mr ladoa  la  Me  CRAY-1  context  la  cron—  tht  onrheod  ■  addi- 
a  the  n**dup  Dtvfco  rohjetioo  mil—  oa  kfvw  aaoljrti- 
annuieiioaa  daptodiap  apoa  do  opontiop  toplop.  Tka 
Y-I  Foma  compil—  prohibits  hnacfciap  afeaa  n— scriiiap 
aad  thug  aO  lha  differs—  shame  tin  raa ulta  mum  ka  do— 
puad  f—  a  devic—  aad  out}  la  Ika  aad  tka  appropriate  data  ara 
aa  ifad  la  the  — luiioa  nuet—  depeodiap  oa  aack  donfcp. 

tarond.  memory  aottc  caa  account  far  a  major  part  af  I* 
T  a  ladabiw  admittance  minx  of  aack  device  an—  ko  acaaarod 
la  »  Ika  droii  raprt— otatioo.  la  SFICEV  tka  tap—  otatioo  la 
»  avoral  drcait  mini  while  ia  CLASS1E  U  h  ika  Mi-cop— 


a  and  aad  iaiaraoaaactioa  metric—  caaa—  ka  nuetoraod  aad  k 
ka  kaaa  maaftirod  aid  a  CHAY-1  simulator  that  tka  nil—  caa 
ika  72%  of  ika  total  turn  af  tka  kipol—  trindaiirr  cnahiaboa 
•kaa  ful  M  alt  mam  nation  an  wad  (91.  Tka  —attar  mm  lha 
hapnnal  auboatria—  — rreapondiap  to  differ—*  imam—  af  tka 
wma  oafl  dt&utwo  la  tka  oaRr  ooa  ikw  caa  ka  —a— nod.  Ip  ihM 
can  tka  computation  domioai—  on—  mam ory  accoaa. 

Third,  lha  apaad  ia  tka  computational  pan  OtpioM  upon  Ika 
tact—  Mapth.  Oreupiap  dmc—  kp  wbdmiit*.  ip.  ad  tniaanri 
MX  of  tka  diffrrtot  i  astatic—  af  aukdmit  SUkl,  cuta  down  tka 
amu—  lima  aia—  ika  moults  an  pond  ia  paraOal  ia  tka  diapoaal 
aubmatrtc— .  Horn*—,  tka  nactor  Itaptk  la  akort—  kacawa  tka 
wink—  af  oocwvna  of  a  aikdrcuit  la  lam  tkaa  tka  aumk—  of 
davima  daaerikad  Ip  tka  mma  modal  Exporimeots  thorn  that  la 
MOSFET  analuaiiop  oak  no  thirda  of  lha  computational  apaad 
vitk  fad  nation  («4  skmtnta  for  tka  CIAY-1)  am  ko  octwnad 
f—  notion  with  ontgr  10  oMmoam. 

Fourth.  ao  a  priori  haowtodpe  oaa  ko  wad  to  cm- 
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ABSTRACT: 

Vector  computer's  have  an  increased  potential  for  fast,  accurate 
aim ulatian  at  the  transistor  level  at  Large-ScaJe-fntegrated  Cir¬ 
cuits.  Design  considerations  Ur  a  new  circuit  simulator  are 
developed  baaed  on  the  specifics  of  the  vector  computer  archi¬ 
tecture  and  of  LSI  circuits-  The  performance  at  the  nee  (SI 
simulator,  CIASSXE,  ia  evaluated  mi  e  CRAY-1  sector  -computer 
for  several  circuits  with  from  a  fee  hunched  to  over  one 
thou  wind  semiconductor  devices.  Comments  ora  given  concern¬ 
ing  the  po-formance  limits  and  relative  hardware  dependence. 

1.  Introduction 

In  recent  years  the  need  for  detailed,  electrical  simu¬ 
lation  for  large  circuits  has  initiated  the  research  for  algo¬ 
rithms  which  are  faster  for  LSI  circuits  and  preserve  the 
same  accuracy  as  in  standard  circuit  simulation  This 
research  has  resulted  in  a  number  of  prototype  circuit  and 
circuit/timing  simulators  such  as  MACRO  ft).  SLATE  [2], 
and  RELAX  [3]  From  an  algorithmic  point  of  view  these 
programs  can  be  classified  as  third-generation  simulators 
All  these  new  programs  use  computers  with  conventional 
architectures 

The  advent  of  vector  computers,  eg  the  CRAY-1  and 
the  CYBER  205.  which  are  capable  of  performing  hundreds 
of  millions  of  floating-point  operations  per  second  (Mflops). 
represents  a  second  major  factor  that  can  be  considered 
in  the  development  of  a  computationally  involved  simula¬ 
tor  High  execution  rates  for  floating  point  arithmetic  are 
achieved  with  new  architectures  which  perform  each  single 
instruction  on  a  multiple  data  stream  (31MD)  and  through 
pipelining  of  instructions  The  specifics  of  the  SIMD  archi¬ 
tecture  and  of  LSI  circuits  have  been  the  two  major  design, 
considerations  in  the  development  of  another  third- 
generation  prototype  circuit  simulator,  CLA551E  [4],  [5]. 

CLASSIE  is  also  intended  to  reduce  the  speed  perfor¬ 
mance  gap  between  a  circuit  simulator  and  a  timing  simu¬ 
lator  It  is  the  only  third  generation  simulator  to  this  point 
which  takes  advantage  of  a  parallel  architecture  of  the 
host  computer 

In  the  development  of  CLASSIE.  an  additional  program 
has  been  used  for  a  first  characterization  of  the  architec¬ 
ture  SP1CEV  is  a  version  of  SPICE  2G  which  contains  vec¬ 
torized  device-model  routines  and  a  scalar  machine-code 
solver  for  the  sparse  linear  equations  Its  speed  perfor¬ 
mance  is  approximately  half  of  that  of  CLASSIE  and  a  few 
times  faster  than  SPICE2,  depending  on  circuit  size. 

2.  Vector  Processors 

The  CRAY-1,  a  160  Mflops  machine,  and  the  CYBER  205. 
an  BOO  Mflops  computer,  have  a  number  of  common  archi¬ 
tectural  characteristics.  Both  processors  have  aa  instruc¬ 
tion  buffer  and  decode  unit,  a  scalar  and  a  vector  process¬ 
ing  unit  Both  have  a  large  number  of  arithmetic  func¬ 
tional  units.  13  for  the  CRAY-1  and  11  (or  17)  depending  on 
configuration  for  the  CYBER  205  In  most  of  these  func¬ 
tional  units  concurrent  processes  can  take  place  In 
scalar  or  vector  computation  a  floating-point  operation  is 


partitioned  into  a  number  of  segments  and  when  an  inter¬ 
mediate  result  is  ready,  it  can  be  chained  directly  to  other 
functional  units  A  resulting  vector  element  is  available  at 
each  clock  cycle  on  the  CRAY-1  and  two  each  cycle  on  the 
CYBER  205. 

Beyond  these  overall  similarities  there  are  specifics 
for  each  processor  The  most  important  differences 
between  the  CRAY-1  and  the  CYBER  205  are  the  clock 
cycle,  12  5  ns  versus  20  ns.  the  instruction  set,  hardware 
level  versus  microprogrammed  instructions,  and  the 
memory.  4  Mword.  64  bits  per  word,  versus  virtual 
memory  Other  differences  include  the  number  of  proces¬ 
sor  registers,  72  address.  72  scalar  and  8  vector  registers 
of  64  elements  for  the  CRAY-1  compared  to  256  registers 
overall  for  the  CYBER  205  The  concurrency  of  operations 
can  be  higher  on  the  CYBER  because  of  the  maximum  of 
four  floating-point  pipes  which  are  part  of  the  vector  unit 
and  which  can  process  an  addition  and  a  multiplication 
each  at  the  same  time 

For  application  programs  it  is  important  to  observe 
the  simultaneous  use  of  add  and  multiply  units,  chaining  of 
operations  from  a  functional  unit  to  another  and  the  aver¬ 
age  vector  length  Hie  last  parameter  is  an  important 
measure  of  the  efficiency  of  a  vector  operation  which  is 
associated  with  an  important  overhead  called  start-up 
time  The  average  vector  length  Tor  which  the  vector  pro¬ 
cessor  reaches  half  the  advertised  speed  is  approximately 
15  for  the  CRAY-1  and  50  for  the  CYBER  205  The  former  is 
faster  for  short  vectors  whereas  the  latter  for  long  vectors 
(over  one  hundred  elements) 

For  a  scientific  application  program  the  most 
interesting  performance  measure  is  an  equivalent  Vflops 
rate  which  incorporates  the  memory  traffic  present  in  any 
algorithm  implementation.  An  est.rnale  of  this  number  is 
derived  below  for  a  circuit  simulator  and  cannot  be 
expected  to  be  larger  than  a  fraction  of  the  maximum  pro¬ 
cessor  speed 

3.  Simulation  of  LSI  Circuits 

A  baric  consideration  in  the  design  of  the  new  simula¬ 
tor  is  the  object  of  the  analysis  Only  simple  circuits  had 
to  be  analyzed  when  SPICE  was  designed  over  ten  years 
ago  These  same  circuits  constitute  today  mere  cells  of  a 
LSI  system  For  the  purpose  of  the  simulation  a  LSI  circuit 
can  be  described  usually  as  a  collection  of  a  limited 
number  of  structurally  different  functional  blocks  such  as 
logic  gates,  operational  amplifiers,  etc  .  each  block  occur- 
ing  more  them  once  at  the  system  level  The  decomposi¬ 
tion  of  the  large  circuit  is  the  major  source  for  speed 
improvement  in  third-generation  circuit  simulators 

The  partitioning  of  an  LSI /VLSI  circuit  into  a  cell  / 
building  block  /  system  structure  is  useful  information  at 
any  level  of  simulation.  The  analysis  in  CLASSIE  is  done  at 
two  levels,  the  system  (building  block)  and  the  cell  (subcir¬ 
cuit)  level  The  new  program  groups  the  cells  described  by 
the  same  definition  together  based  on  hierarchical  tearing 


CH 18 1 3-5; 82,  0000  OI72SOO  75  •  1982  IEEE 


172 


(derived  from  the  input  description)  and  solves  each  group 
in  one  pass  through  the  same  code 

An  important  guideline  in  the  design  tor  speed-up  is 
that  the  number  of  items  which  define  a  vector  be  max¬ 
imum  For  this  purpose  a  feature  in  the  input  language  is 
provided  to  pass  parameters  to  the  cells  with  identical 
topology  (described  by  the  same  definition)  Another 
necessary  feature  for  the  convergence  of  large  circuits  is 
the  ability  to  define  initial  conditions  local  to  each  cell 
instance 

Two  major  parts  of  a  circuit  simulator  can  be  singled 
out  as  requiring  most  of  the  floating-point  computation 
and  therefore  of  the  run  time  These  are  the  evaluation  of 
the  nonlinear  characteristics  of  the  semiconductor  devices 
and  the  solution  of  the  resulting  linear  equations  The 
impact  of  vetorization  on  these  two  major  components  is 
presented  m  this  section  as  well  as  in  the  section  com¬ 
menting  on  results 

The  speed-up  of  the  semiconductor  model  evaluation 
is  essential  since  it  usually  accounts  for  more  than  half  of 
the  total  cpu  time  of  VOSFET  circuits  even  when  these  are 
large  The  use  of  generated  machine  code  for  the  matrix 
solution  [5]  practically  reduces  the  contribution  of  this 
part  of  the  simulation  to  less  than  one  fifth  of  the  total 
analysis  time,  even  for  very  large  circuits  The  data  struc¬ 
ture  has  been  specifically  tailored  to  accommodate  vector 
operations  with  minimal  gather/scatter  overhead 

Model  evaluation  can  be  partitioned  in  a  number  of 
tasks  A  model  parameter  gather  is  followed  by  device  ini¬ 
tialization.  terminal  voltages  Initialization,  equivalent  con¬ 
ductance  computation  and  in  the  end  by  an  indefinite 
matrix  terms  scatter 

In  CLASSIE  as  in  SP1CE2.  semiconductor  devices  are 
described  by  geometrical  features  which  are  individual  for 
each  device,  and  general  parameters,  eg.  saturation 
current  of  a  pn  junction  or  threshold  voltage  for  a  VOS 
structure  The  first  task  accomplished  by  the  model  rou¬ 
tine  and  called  model  parameter  gather,  is  to  obtain  the 
model  parameters  from  memory  After  each  device  is 
linearized  a  scatter  operation  takes  place  during  which  the 
device  indefinite-admittance  matrix  is  stored  in  the  circuit 
matrix  These  two  operations  account  for  more  than  half 
of  the  Ume  spent  in  the  model  routines  of  SP1CE2 

In  the  setup  phase  a  device  reordering  takes  place  by 
hierarchical  level  (cell  or  system)  and  by  model  The 
model  parameters  need  thus  be  gathered  only  as  many 
times  as  there  are  model  definitions  The  importance  of 
the  model  parameter  gather  is  made  negligible  in  CLASSIE 
based  on  the  fact  that  more  than  one  device  uses  the  same 
model  parameters 

The  initialization  and  scatter  operations  can  be  per¬ 
formed  using  vectors  only  for  subcircuits  of  the  same 
topology  For  the  interconnection  circuitry  these  tasks 
are  performed  sequentially,  device  by  device  The 
voltage-limiting  and  equivalent  conductance  computation 
use  vector  operations  on  devices  grouped  mainly  by 
models 

An  Important  issue  is  the  definition  of  vectors 
throughout  the  model  evaluation  For  the  transistors  at 
the  system  level  it  Is  quite  straightforward  to  define  a  vec¬ 
tor  across  all  devices  which  reference  the  same  model  As 
has  already  been  mentioned  only  the  computation  part  is 
vectorized  for  these  elements 

The  different  possibilities  to  define  vectors  can  be 
presented  best  by  the  following  example  Assume  that  a 
circuit  contains  12  instances  of  a  subcircuit  OPAVP  which 
in  turn  has  20  MOS  transistors  The  semiconductor  devices 
at  the  subcircuit  level  can  define  a  vector  across  all 
instances  of  that  cell  Thus,  the  transistors  named  M09  in 
all  12  instances  of  the  subcircuit  OPAVP  are  linearized  in 
one  pass  through  the  code  This  results  in  20  passes 
through  the  model  evaluation  code  with  a  vector  length  of 


12  each  Ume  Although  all  tasks  can  be  vectorized  in  this 
approach  a  longer  vector  can  be  used  in  the  computation 
phase  where  all  transistors  of  the  same  model  and  for  all 
instances  of  the  subcircuit  can  be  grouped  together  In 
vne  initialization  and  scatter  tasks  the  longer  vector  used 
in  computation  is  divided  into  a  number  of  short  vectors 
which  contain  as  many  elements  as  cell  instances  The 
gain  in  this  approach  comes  from  a  reduction  in  start-up 
limes  for  more  vector  operation  with  shorter  vector 
lengths  In  the  above  example  assume  that  5  of  the  20 
transistors  are  depletion  loads  and  are  characterized  by 
the  same  model  parameters  and  that  the  remaining  15  are 
enhancement  devices  and  are  also  described  by  a  unique 
model  For  the  12  instances  the  execution  of  the  computa- 
Uon  loop  is  reduced  from  20  times  to  4  times  (once  for  the 
depletion  devices  and  three  times  for  the  180  enhance¬ 
ment  devices)  for  a  maximum  vector  length  of  64 

Another  trade-off  in  the  design  can  be  between  a 
longer  vector  loop  which  performs  also  more  computation 
than  necessary  or  a  number  of  shorter  loops  to  which  the 
execution  is  directed  depending  by  analysis  status  flags 
Both  approaches  lead  to  almost  similar  speeds 

As  a  final  comment,  the  convergence  check  of  the 
semiconductor  devices  is  performed  in  the  same  manner 
as  for  the  node  voltages  The  terminal  voltages  and  device 
currents  are  compared  m  a  vector  loop  and  a  vector  with 
ones  for  the  diverging  elements  and  zeroes  for  the  con¬ 
verging  ones  is  set  up  A  fast  vector  accumulation  library 
routine  is  then  used  for  a  fast  result 

4.  Results  on  the  CRAY-1 

A  number  of  large  circuits  containing  from  a  few  hun¬ 
dred  to  over  one  thousand  devices  or  equations  have  been 
analyzed  Two  typical  circuits  are  built  of  two  cells,  a  bipo¬ 
lar  NAXD  gate  and  an  VOS  operational  amplifier,  together 
with  interconnection  circuitry  The  size  of  the  circuit  ls 
easily  varied  changing  the  number  of  instances  of  the 
different  cells  In  the  case  of  the  VOS  filter  of  Table  1  the 
circuitry  at  the  system  level  (VOS  switches  and  capacitors) 
also  increases  with  complexity  A  statistical  description  of 
the  benchmarks  is  given  in  Table  1  from  both  the  point  of 
view  of  a  flat  representation  as  in  SP1CE2  and  a  two-level 
analysis  as  used  by  CLASSIE 
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Table  1 


The  three  adders  are  ell-NAND  circuits  containing 
approximately  60%  bipolar  transistors  and  40%  diodes  The 
filter  is  an  NVOS  switched-capacitor  lowpass  filter  contain¬ 
ing  i0  lowpass  sections  with  two  operational  amplifiers  per 
section  and  two  antialiasing  and  reconstruction  circuits 
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Four  equations  out  of  sixteen  representing  the  NAND 
subcircuit  correspond  to  external  nodes.  The  OPAMP  cir¬ 
cuit  has  five  external  nodes  The  external  node  contribu¬ 
tions  are  gathered  in  the  interconnection  matrix  It  is 
interesting  to  note  that  two  cells  of  totally  different  func¬ 
tion  and  complexity,  8  devices  for  the  NAND  gate  versus  25 
for  the  OPAMP,  have  the  same  size  matrix  representation. 

In  Table  1  can  be  found  also  the  percentage  contribu¬ 
tions  of  model  evaluation  and  linear  equation  solution  to 
the  simulation  of  the  respective  circuits  The  data  written 
in  parantheses  for  SPICE2  are  obtained  from  runs  using 
scalar  code  generation.  The  increase  in  relative  impor¬ 
tance  for  the  Fortran  equation  solution  can  be  explained 
by  both  the  increase  of  search  with  increasing  complexity 
as  well  as  by  the  reduction  of  the  model  evaluation  as  more 
devices  are  bypassed 

The  analysis  time  per  iteration  for  a  subcircuit- 
onented  program  such  as  CLASS1E  can  be  expressed  for 
only  one  subcircuit  type  as  [5j: 

T  =  Tj  +■  n**T<  +•  overhead  (l) 

where  T(  and  T,  are  the  times  for  one  iteration  at  the  inter¬ 
connection  and  the  subcircuit  level,  respectively,  n,  the 
total  number  of  subcircuit  instances  Tt  and  T*  are  a 
function  of  two  characteristic  times  for  a  circuit  simulator, 
ta.  the  time  for  one  model  evaluation,  and  t,.  the  time  for 
solving  one  equation- 

T  =  n4*t*  *  (2) 

whera  n*  and  n*  are  the  number  of  devices  and  of  equa- 
tions,  respectively,  and  a  second  subscript  i'  will  stand  for 
interconnection  while  s'  will  stand  for  subcircuit 
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Table  2  summarizes  the  two  characteristic  times  t* 
1  t..  introduced  earlier  in  this  section,  as  well  as  the 
•me  per  iteration  and  and  the  speedup  factor  between  the 
■i  programs.  A  first  observation  from  the  above  data  is 
.  is  very  large  effect  of  the  scalar  machine  code  generation 
which  cuts  the  equation  solution  time  by  a  factor  of  8  to 
12  Considering  still  the  numbers  referring  to  the  equation 
solution  it  should  be  noticed  that  t,  increases  wtth  the 
number  of  equations  for  the  Fortran  solver  whereas  It 
differs  very  little  in  the  machine-code  solver  The 
difference  in  the  machine-code  solver  can  be  explained  on 
the  basis  that  another  factor  becomes  dominant,  i  s., 
number  of  floating-point  operations.  This  means  that 
there  are  more  floating-point  operations  per  equation  in 
average  for  the  Filter  The  effect  of  this  number  seems  to 
be  absorbed  in  other  search  and  memory  operations  when 
the  Fortran  solver  is  used. 

ta  is  also  a  very  important  number  The  model  vector- 
Izatioo  is  seen  to  bring  about  a  speedup  around  1  5  for  the 
bipolar  mix  (diode  and  BJT)  and  around  a  factor  of  2  for 
the  MOS  circuit  The  speedup  for  this  part  is  not  larger 
because  more  computation  is  performed  to  evaluate  all 
possible  formulations  of  equivalent  conductances  which 
depend  on  region  of  operation  This  approach  replaces 


branching  in  the  vectorized  computation  by  vector  merge 
operations.  Another  factor  is  that  the  parameter  gather 
and  conductance  scatter  is  not  vectorized  The  larger 
value  of  t*  for  smaller  circuits  run  on  SPICE2  (see  Table  2) 
is  the  result  of  less  bypass  than  for  a  large  circuit 

The  speedup  factor  is  influenced  by  several  elements 
such  as  the  ratio  of  t*  vs  t,  and  of  n*  v»  n.  The  speedup  is 
larger  for  the  bipolar  circuits  because  the  contribution  of 
equation  solution  in  all-Fortran  SPICE2  (see  Table  l)  is 
much  larger  than  for  the  MOS  circuits  This  part  is 
reduced  more  effectively  by  the  code  sol-'er  than  the 
model  evaluation  by  vector  operations 
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58 

5 
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52 

68 
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29 

3 

14  5 

• 

Table  3 

The  evaluation  of  CLASS1E  is  presented  in  Table  3.  The 
characteristic  times  of  Eq  2  are  derived  As  already  men¬ 
tioned  the  characteristic  times  differ  between  the  n  ter- 
connection  circuitry  and  the  subcircuits  because  of  the 
gather/scatter  which  is  part  of  t*,  where  x  stands  for 
ther  l  for  the  interconnection  and  s  for  the 
subcircuits.respectively  The  equation-solver  characteris¬ 
tics  are  also  different  based  on  the  use  of  vector  code  for 
the  subcircuits  and  sc  alar  for  the  interconnection. 

The  data  in  Table  3  should  be  mewed  m  connection 
with  the  circuit  statistics  presented  in  Table  1  An  impor¬ 
tant  specification  is  that  the  runs  for  this  table  have  had 
just  one  parasitic  resistance  in  the  BJT  model  and  none  in 
the  VOSFET  model.  The  reduction  in  characteristic  times 
for  the  subcircuit  can  be  seen  to  be  larger  with  increasing 
number  of  instances  From  the  analysis  of  the  results 
from  SP1CEV  it  is  expected  that  the  speedup  is  Larger  tor 
the  adder  circuits  compared  to  the  Alters  A  first  observa¬ 
tion  relates  to  for  MOSFETs  which  is  reduced  by  another 
factor  of  almost  2  compared  to  the  time  in  SPICEV  The 
devices  at  the  interconnection  circuitry.  181  out  of  758 
MOSFTTs.  are  still  characterized  by  a  Ls  of  52  ms  The 
reduction  in  the  ta»  parameter  for  the  bipolar  mix  (diodes 
and  transistors)  is  closer  to  25  to  30X 

Two  numbers  characterize  the  sparse  solver;  one  u 
the  parameter  while  the  second  is  the  Mflops  rate 
These  numbers  are  computed  from  the  run  statistics  which 
provide  information  such  as  the  time  for  the  subcircuit 
and  interconnection  solver,  the  total  number  of  iterations, 
the  number  of  operations  for  each  subcircuit  matrix  etc. 
The  number  of  operations  Includes  the  add.  subtract,  mul¬ 
tiply  and  divide  because  on  the  CRAY-1  these  times  are 
very  close  to  each  other;  an  addition/subtraction  takes  8. 
a  multiplication  7  and  a  reciprocal  approximation  14  cp 
cycles  The  Mflops  rate  Is  a  better  characteristic  of  the 
solver  than  the  t,  parameter  The  reason  for  it  is  that  the 
operation  count  provides  the  best  measure  of  the  compu¬ 
tational  effort.  This  number  proves  to  be  stable  for 
different  spertity  patterns  and  is  therefore  a  good  charac¬ 
teristic  of  the  sparse  solver  on  the  CRAY-1  Both  for  SPI- 
CEV  and  CLASSIE  interconnection  equations  the  scalar 
solver  performs  at  5  3  Mflope. 

The  vector  solver  Is  more  dependent  on  the  matnx 
structure  and  vector  length  (instances)  The  speed  is 
between  14  5  -  17  8  Mflops  which  is  impressive  but  is  below 
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that  predicted  by  Calahan  (8J  Aj  in  the  case  of  SP1CEV  the 
speedup  for  the  MOSFET  circuit  is  lower,  3  7.  compared  to 
3  in  the  case  of  the  Adder4  and  8  for  the  Adderl6.  Chang¬ 
ing  the  mix  between  equations  and  devices  by  introducing 
parasitic  series  resistances  in  the  models  brings  about 
higher  speedups  as  predicted  by  Eq  2  The  three  numbers 
given  in  the  speed-up  column  in  Table  3  for  bipolar  circuits 
correspond  to  one  series  resistance  in  the  base  two  in  the 
base  and  collector,  and  three  in  the  base,  collector  and 
emitter,  respectively  The  two  data  for  the  MOS  circuits 
are  with  and  without  parasitic  drain  and  source  resis¬ 
tances.  The  speed-up  of  10  for  the  Adder*  with  two  parasi¬ 
tic  resistances  is  due  to  an  additional  reordering  of  the  cir¬ 
cuit  equations  performed  by  SPICE2  which  increases  con¬ 
siderably  the  number  of  flll-ins  compared  to  CLASSIE 
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Table  4 

In  Table  4  the  results  of  the  choice  of  data  structures, 
two-level  analysis  and  other  features  of  CLASSIE  are  com¬ 
pared  with  SPICEV  task  by  task  From  the  above  data  it 
can  be  seen  that  the  routines  which  perform  gathering  of 
parameters,  initialization  and  scattering  of  matrix  terms 
from  and  to  memory  take  more  than  one  third  of  the  total 
time  m  SPICEV  because  of  the  sequential  mode  in  which 
they  are  executed  In  the  percentage  for  the  above  tasks 
is  included  also  the  contribution  of  the  device  linearization 
control  In  CLASSIE  however  the  analytical  model  evalua¬ 
tion  has  become  dominant  as  desired  If  analytical  models 
are  used  the  maximum  speedup  is  already  in  the  program 
for  this  part 

In  CLASSIE  the  percentage  time  spent  to  link  the  sub- 
circuits  with  the  interconnection  matrix  is  of  the  order  of 
5-1  OX  percent 

Another  important  observation  is  the  overhead  time 
FVora  simulation  runs  it  can  be  noticed  that  as  the  main 
parts  of  the  analysis  (device  evaluation  and  equation  solu¬ 
tion)  are  made  faster  the  importance  of  the  overhead 
grows  The  major  part  of  the  overhead  is  contributed  by 
memory  manager  operations  (moving  blocks  around)  in 
SPICEV.  are  reduced  to  less  than  SX  in  CLASSIE  Another 
source  of  concern  are  the  time-step  control  computations 
The  adders  use  iteration-count  while  the  fllteri  use  trunca¬ 
tion  error-time-step-control.  There  is  not  much  time 
spent  in  the  truncation  error  evaluation,  since  it  has  been 
vectorized  Its  contnbution  is  down  from  15X  in  early  ver¬ 
sions  of  SPICEV  to  approximately  less  than  2X  in  CLASSIE 

A  major  problem  for  CLASSIE  can  be  a  reduced 
number  of  occurences  of  identically  structured  subcircuits 
which  will  increase  the  time  for  computation.  The  Lowpass 
section  which  achieves  a  vector  length  of  only  two  in  most 
vectorized  code  la  simulated  at  half  speed  of  SPICEV 
Defining  long  vectors  in  model  evaluation  however  reduces 
the  run  time  on  CLASSIE  to  that  on  SPICEV.  This  result  in 
itself  is  very  important  because  in  the  worst  case  of  only  2 
instances  of  a  cell  the  two-level  analysis  of  CLASSIE  is 
equally  fast  to  a  vectorized  SPICE  The  performance  of 
CLASSIE  is  expected  to  be  superior  for  more  than  two 
instances  of  each  cell  type 


The  use  of  table  models  for  devices  will  reduce  by  up 
to  5-10X  the  overall  30-40X  which  the  evaluation  time  of 
analytic  equations  contributes  to  the  analysts  time.  The 
major  advantage  of  table  models  m  this  context  is  that  the 
same  sequence  of  operations  Is  performed  for  all  devices 
regardless  of  the  operating  region  For  a  Mflops  machine 
the  use  of  table  models  for  speed  is  of  secondary  impor¬ 
tance  This  is  proven  by  a  run  of  the  Filter  benchmark 
us  mg  the  simple  Shichman-Hodges  model  for  MOSFET* 
Because  this  model  is  so  simple,  its  use  provides  a  good 
estimate  of  table-lookup  for  dc  characteristics  The  per¬ 
centage  time  spent  in  the  Device  Equation'  part  which 
computes  the  conductances  associated  only  with  the 
modeling  of  the  transport  In  the  inversion  channel  of 
Table  5  is  reduced  to  3X  from  17  3X  This  simple  routine 
achieves  approximately  60  Mflops 

S.  Conclusion 

The  results  presented  in  the  last  section  suggest  that 
the  speed-up  which  CLASSIE  offers  compared  to  Fortran 
SPICE2  on  the  same  computer  is  a  function  of  the  circuit 
The  two  bounds  on  performance  improvement  are  the 
model  evaluation  speed-up  for  a  device  linearization  dom¬ 
inated  simulation  and  the  equation  solution  speed-up  when 
this  part  is  percentage-wise  the  most  important 

CLASSIE  can  run  on  other  vector  and  scalar  comput¬ 
ers  but  will  not  achieve  the  same  speed-up  as  on  the 
CRAY-1  comparative  to  SP1CE2  It  is  estimated  that 
changes  in  the  data  storage  are  necessary  for  optimal  per¬ 
formance  of  CLASSIE  on  the  CYBER  205 

In  conclusion  a  total  speedup  of  up  to  an  order  of 
magnitude  can  be  predicted  for  a  circuit  with  over  one 
thousand  devices  (nodes)  simulated  on  CLASSIE  relative  to 
SP1CE2  running  on  the  CRAY-1 

Acknowledgement 

This  work  has  been  sponsored  by  a  grant  from  the  Bell 
Laboratories  Inc  Also  acknowledged  are  the  research 
grants  ARO  DAAG29-B1-K-0021  and  AFOSR  62-002:  The 
CRAY  computer  time  made  available  by  United  Information 
Services  is  greatly  appreciated 

8.  References 

[1]  N  G  B  Rabbat.  A  L  Sangiovanm-Vincentelli.  and 
H  Y  Hsieh.  “A  Multilevel  Newton  Algorithm  with  Macro- 
modeUng  and  Latency  for  the  Analysis  of  Large-Scale 
Nonlinear  Circuits  in  the  Time  Domain".  Trans  IEEE 
Vol  CAS- 26.  Sept  1879 

[2]  P  Yang  1  N  Hajj,  and  TN  Tnck.  "SLATE  A  Circuit  Simu¬ 
lation  Program  with  Latency  Exploitation  and  Node 
Tearing".  Pro c  ,  Int  Conference  on  Circuits  and  Com¬ 
puters.  New  York.  Oct  1980 

[3]  E  Lelarasmee,  A  E  Ruehli.  and  AL  Sangiovanm- 
Vincentelli,  "The  Waveform  Relaxation  Method  for  Time 
Domain  Analysis  of  Large  Scale  Integrated  Circuits". 
ERL  Memo  No  M81/75.  June  1981 

[4]  A  Vladimirescu  and  DO  Pederson.  "A  Computer  Pro¬ 
gram  for  the  Analysis  of  LSI  Circuits".  FVocetdmgs. 
IEEE  International  Symposium  on  Circuits  and  Sys¬ 
tems.  Chicago,  Illinois.  April  1981 

[5]  A  Vladimirescu  and  D  0  Pederson.  "Performance  Lim¬ 
its  of  the  CLASSIE  Circuit  Simulation  Program". 
FVoceedi>vs.  IEEE  International  Symposium  on  Cir¬ 
cuits  and  Systems.  Rome.  Italy.  May  1982. 

[6]  D  A  Calahan.  "Multi-Level  Vectorized  Sparse  Solution 
of  LSI  Circuits”.  /Voceedings.  Int  Conference  on  Cir¬ 
cuits  and  Computers.  New  York.  Oct  1980 


175 


ACKNOWLEDGEMENTS 

I  would  like  to  thank  Professor  D.  0.  Pederson  who  has  given  me  the 
opportunity  to  work  on  this  topic  as  a  post-doctorate  research  associate. 
Ihe  interaction  with  the  members  of  the  Compbter-Aided-Design  group,  in 
particular  discussions  with  Tom  Quarles  and  Clem  Cole,  are  acknowledged. 

The  helpful  suggestions  and  support  with  up-to-date  documentation 
from  Ed  Kushner  and  Steven  Nakamoto  from  Floating  Point  Systems  is 
greatly  appreciated. 

The  support  of  the  research  grants  AFOSR-B1-0021  and  SRC-82- 1 1-008. 
and  the  Interest  of  Analog  Devices  Semiconductor  have  been  essential  to  the 
work  presented  in  this  report. 


12B  Circuit  Simulation  on  AHarhori  Array  Process ore 

Andrei  Vladimirescu 

ABSTRACT 

The  simulation  of  L*rge-Sca]e-Integrated  (LSI)  circuits  requires  very 
long  run  times  on  conventional  circuit  analysis  programs  such  as  SP1CE2  and 
super-mini  computers.  A  new  simulator  for  LSI  circuits,  CLASSIE.  which 
takes  advantage  of  circuit  hierarchy  and  repetitiveness,  and  array  proces¬ 
sors  capable  of  high-speed  floating-point  computation  are  a  promising  combi¬ 
nation. 

The  program  development  software  environment  of  the  Floatirqj  Point 
Systems  104  is  evaluated  based  on  the  experience  gained  with  the  conversion 
of  both  SPICE2  and  CLASSIE  to  this  machine.  The  FPS-164  has  been  used  as 
an  attached  processor  to  a  VAX  11/780  with  the  UNIX  operating  system. 

The  performance  of  the  two  simulation  programs  on  the  host  computer, 
the  VAX.  and  the  attached  processor  is  compared.  The  FPS-104  architecture 
and  Fortran  compiler  are  evaluated  by  means  of  the  speedup  of  CLASSIE 
compared  to  SP1CE2  on  the  same  processor. 


CHAPTER  1 


INTRODUCTION 


Die  Sim  ulation  of  Luge -Sc  ale -Integra  ted  (LSI)  circuits  requires  very 
long  run  times  on  standard  circuit  analysis  programs  such  as  SP1CE2  and 
standard  hardware  of  the  super-mini  or  main-frame  computer  class  (0.5  to  2 
Mips).  A  new  simulator  for  LSI  circuits.  C LASSIE,  has  been  developed 
recently  [VladB2]  which  is  more  efficient  and  preserves  the  same  accuracy. 
This  report  describes  the  experience  and  results  obtained  when  adapting 
SP1CE2  and  CLASSIE  to  a  commercially  available  array  processor,  the  Float¬ 
ing  Point  System  164.  attached  to  a  super-mini  host  computer,  the  VAX 
11/7B0.  As  brought  out  later  Cole  [Cole  63]  has  implemented  a  first  version  of 
SP1CE2  on  the  FPS-164  attached  to  a  VAX  11/760  with  the  UNIX  operating  sys¬ 
tem. 

SPICE  was  developed  over  a  decade  ^o  for  typical  SSI  circuits  and 
scalar  computers  of  the  time.  The  program  operates  on  an  entire  circuit 
which  is  processed  at  the  individual  electrical  element  level.  Two  basic  fac¬ 
tors  of  present  technology  have  been  considered  in  the  design  of  the  new  LSI 
circuit  simulator.  CLASSIE.  The  first  one  Is  that  LSI  circuits  are  usually  a  col¬ 
lection  of  a  limited  number  of  structurally  identical  functional  blocks  such  as 
logic  gates,  operational  amplifiers,  ete.  the  second  factor  is  the  availability 
of  parallel  computer  architectures  which  provide  an  ideal  environment  for 
fast  computations  on  repetitive  structures.  The  analysis  in  the  new  program 
lakes  into  consideration  the  hierarchy  of  the  LSI  circuit  The  identical  func- 


tional  blocks  are  grouped  together  and  the  simulation  Is  performed  at  two- 

levels. 

The  above  design  considerations  speed  up  the  simulation  of  an  LSI  cir¬ 
cuit  performed  by  CLASSIE  up  to  an  order  of  magnitude  compared  to  SP1CE2 
an  a  CRAY-1  super  (vector)  computer.  From  the  point  of  view  of  the  simula¬ 
tion  speed  for  a  large  circuit  on  a  vector  computer  CLASSIE  rates  between 
SPICE2  and  a  timing  simulator. 

The  parallel  architecture  of  the  FPS-164  attached  array  processor  is 
conceptually  different  from  the  CRAY-1;  computationally  intensive  codes  can 
be  sped  up  however  following  the  same  basic  concepts  as  in  the  case  of  the 
CRAY-1.  The  floating-point  computation  rates  of  the  CRAY-1,  the  FPS-164 
and  the  VAX  11/780  with  a  floating-point  accelerator  are  160.  12.  and  1 
Ifflops.  respectively.  The  speeds  specified  for  the  vector  and  array  processor 
are  estimates  based  on  the  assumption  that  more  than  one  operation  is  pro¬ 
cessed  at  the  same  time.  Thus,  as  a  rule  of  thumb,  a  computationally  inten¬ 
sive  program  such  as  a  circuit  simulator  should  run  as  many  times  faster  on 
the  parallel  processors  as  specified  by  the  raw  speedup  if  the  implementa¬ 
tion  takes  full  advantage  of  the  architecture. 

A  general  overview  of  the  FPS-164  array  processor  (AP)  is  presented  in 
Chapter  2.  After  a  brief  description  of  the  architecture  a  critical  view  of  the 
system  and  program  development  software  available  on  the  AP  is  presented. 

Chapter  3  provides  a  closer  look  at  the  details  of  porting  two  circuit 
simulators.  SP1CE2  and  CLASSIE,  to  the  FPS-164.  SPICE  has  been  developed 
veer  the  past  14  years  with  no  specific  computer  architecture  in  mind  while 
CLASSIE  provides  the  same  algorithms  as  the  former  program  tailored  for 
parallel  processing 


3 


A  performance  evaluation  of  the  two  program*  follow*.  The  execution 
Speed  of  SP1CE2  La  compared  to  a  general  super-mini  computer  such  as  the 
VAX-1 1/780  while  the  speedup  due  to  parallelism  is  emphasized  for  CLASSIC. 

Conclusions  on  the  implementation  and  performance  of  circuit  simula¬ 
tion  programs  on  the  FPS-164  are  the  subject  of  Chapter  5.  . 

The  work  described  in  this  report  has  been  performed  on  an  FPS-164  AP 

\ 

running  the  D'  software  release  attached  to  a  VAX  11/780  running  release 
4.  lc  BSD  of  the  UNIX  operating  system. 


CHAPTER  2 


Hie  FPS-164  Attached  Processor 


2.1.  Introduction 

Thi>  chapter  provides  a  brief  description  of  the  FPS-164  processor.  The 
architecture  is  outlined  first  with  emphasis  on  the  parallel  processing 
features. 

From  a  programmer's  point  of  view  the  most  important  means  to  benefit 
from  the  architectural  capabilities  of  a  computer  is  its  software  environ¬ 
ment.  The  second  section  takes  a  critical  look  at  the  two  operation  modes  or 
the  AP  and  the  system  and  program  development  software.  The  main  com¬ 
ponents  of  the  program  development  software,  e.g..  the  fortran  compiler, 
debi^ger,  mathematics  library,  etc.,  are  evaluated.  The  experience  gained 
from  porting  SP1CE2  and  C lASSIE  to  the  FPS-164  is  commented  on  wherever 
appropriate. 

22  Hardware 

The  term  array  processor  Identifies  a  single  peripheral  processor  with 
high-speed  floating-point  computation  capability  which  can  be  attached  to  a 
general-purpose  computer  system.  The  tandem  combination  usually  pro¬ 
xies  a  much  higher  computation  power  than  the  host  alone.  Although  the 
architectural  synopsis  and  name  can  cause  confusion  with  the  vector  com¬ 
puters  the  term  sits y  processor  refers  to  a  distinct  category  of  pipelined 
Sngls-lnstruction-MuKiple-Data  (SI HD)  processors. 
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Tbe  Floating  Point  Systems  AP-120B  and  FPS-164  are  examples  of  com¬ 
ma  rc  (ally  available  array  processors.  The  former  is  limited  by  a  35-bit  word 
while  tbe  latter  is  better  suited  for  scientific  applications  where  a  64-bit  data 
word  Is  necessary.  Tbe  architectural  features  [CharBl]  include  multiple 
(eight)  functional  units,  multiple  (seven)  high-speed  data  paths,  two  data 
register  units  of  32  registers  each,  up  to  7.2S  Ifword  main  memory  where 
data  and  instructions  are  stored  separately,  and  a  167  ns  cycle  time.  The 
functional  units  allow  a  maximum  of  two  data  computations,  two  memory 
accesses,  an  address  computation,  four  data  registers  accesses,  and  a  condi¬ 
tional  branch  to  be  initiated  in  a  given  CPU  cycle. 

Tbe  processor  achieves  performance  through  parallelism  and/or  pipelin¬ 
ing  A  short  pipe.  2  stages  for  the  add  and  3  for  the  multiply  unit,  character¬ 
ize  the  FPS-164.  This  design  matches  the  clock  cycle  time  and  explains  the 
difference  in  performance  compared  to  the  faster  vector  computers,  CRAY-1 
and  CYBER  205.  The  short  pipe  has  an  advantege  of  providing  most  of  the 
computation  epeed  for  a  relatively  short  vector  length.  [VladB2]. 


13.  Software  fhvironment 

The  two  major  components  of  the  processor  software  are  tbe  eystem 
software  used  at  run  time  and  program  development  software  which  assists 
tbe  conversion  of  a  high-level  larguage  code  into  an  executable  module.  The 
specifics  of  both  components  of  the  A P  software  are  outlined  in  the  following 
two  sections. 


i 


e 


B.&1.  System  Software 

There  are  two  major  operating  systems  available  for  the  TPS*  164,  the 
Attached  Processor  Executive  APEX  and  the  Single  Job  Executive  SJE  The 
two  operating  systems  correspond  to  the  two  basic  approaches  of  using  the 
AP.  Programs  executing  under  APEX  perform  certain  tasks  on  the  host  com* 
puter  and  other  tasks  on  the  AP.  Input  and  output  routines  which  interact 
with  the  user  and  perform  more  character-string  operations  rather  than 
floatiig-point  operations  can  be  effectively  run  on  the  host.  The  computation 
intensive  parts  of  the  program  will  however  run  fastest  on  the  AP.  APEX  con¬ 
trols  the  timely  transmission  of  data  between  host  and  AP  during  the  execu¬ 
tion  of  the  program. 

Programs  executing  under  SJE  run  on  the  AP  only.  The  executable 
module  together  with  the  relevant  data  files  are  transferred  to  the  AP  before 
a  run  is  initiated.  Upon  completion  of  the  job  the  files  or  interest  are 
transferred  back  to  the  host  computer. 

The  conversion  of  SP1CE2  and  C1ASSIE  to  FPS-164  run  under  SJE  only. 
The  AP  works  together  with  a  VAX  running  tbe  UNIX  operating  system. 

IM.  Program  Development  Software 

Hie  software  available  for  program  development  includes  a  fortran  com¬ 
piler,  APFTNA4,  a  linker.  AP1JNX04.  object  module  librarian,  APLIBR04,  sym¬ 
bolic  debugger,  APDEBUC64,  assembler,  APALA4,  and  mathematics  library. 
APMATH64. 
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B.&2.1.  APPTN64 

APITN84  La  a  cross  compiler  which  runs  on  the  host  computer  and  pro¬ 
duces  Instructions  which  are  executed  on  the  attached  processor.  This  is 
basically  an  F77  compiler  with  a  number  of  extensions  intended  to  utilize  the 
parallel/pipelined  architecture  of  the  processor.  There  are  several  ways  a 
programmer  can  take  advantage  of  the  architecture.  One  approach  is 

k 

through  5  different  levels  of  optimization  provided  by  the  compiler. 

OPT=0  implies  the  simplest  compiler  action  where  each  fortran  state¬ 
ment  is  treated  individually;  experience  has  been  that  at  this  level  a  program 
always  works  once  it  is  operational. 

0PP=1  signals  the  compiler  that  H  can  consider  blocks  of  statements  at 
one  time  for  generating  machine  code;  a  block  consists  of  consecutive  state¬ 
ments  which  finish  in  a  'jump*  or  1/0  instruction. 

0PT=2  enables  the  compiler  to  try  a  global  optimization  across  state¬ 
ment  blocks  as  defined  above. 

0P1V3  adds  pipelining  to  the  above  optimizations  which  exploit  only 
parallelism:  multiple  elements  of  an  array  are  processed  by  setting  up  one  or 
two  pipes  through  the  functional  unit(s). 

0PT=4  is  defined  as  'unsafe  code  motion'  and  consists  in  moving  invari¬ 
ant  expressions  outside  the  body  of  DO  loop.  As  long  as  no  'zero-trip'  loops 
occur  in  the  program  this  level  of  optimization  may  provide  an  additional  few 
percent  of  speed  improvement 

The  approach  for  writing  fortran  code  which  takes  advantage  of  the 
architecture  is  similar  to  the  guidelines  followed  for  other  parallel  machines, 
a.g..  the  CRAY-1.  [Vlad62].  A  'well-behaved'  DO  loop  in  which  operations  with 
array  alementa  are  performed  is  translated  on  all  machines  into  a  'vector 
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operation'.  The  difference  is  that  on  the  CRAY-1  the  elements  of  an  array  are 
loaded  into  hardware  vector  registers  and  a  vector  operation  is  performed 
whereas  on  the  FPS-lfM  a  2-3  stage  pipeline  is  set  up  through  the  functional 

smits. 

Release  D  of  the  fortran  compiler  which  has  been  used  in  this  project 
has  been  found  to  generate  incorrect  code  for  0PTb2.  A  typical  symptom  is 
that  the  attached  processor  bangs  without  being  able  to  be  initialized  unless 
the  host  computer  Is  rebooted.  The  compiler  seems  to  fail  to  interpret 
correctly  loops  based  on  test  and  jump.  Working  code  has  been  however  gen¬ 
erated  for  'well-behaved'  DO  loops. 

In  some  cues  even  OPT=  1  can  produce  wrong  code.  The  approach  of 
tracing  back  the  latter  case  is  to  locate  the  routine  which  does  not  execute 
properly  and  recompile  it  with  a  lower  level  of  optimization.  This  failure 
mode  does  not  bang  the  machine;  it  results  just  in  an  erroneous  behaviour  or 
acme  routines,  e.g  ,  SP1CE2  prints  an  error  message  for  a  perfectly  valid 
statement. 

An  useful  option  of  the  APFTNS4  fortran  compiler  is  which  turns  off  the 
overflow/underflow  interrupts  generated  during  the  execution  of  a  user  pro¬ 
gram  Unless  this  option  is  used  for  some  of  the  device  routines  SPICE2 
aborts  when  an  underflow  occurs. 

Another  criticism  of  APFTNS4  when  compared  to  another  parallel  pro¬ 
cessor  fortran  compiler,  viz.,  the  CRAY  CFT  [CrayflO]  fortran  compiler,  is  its 
noncommunicative  nature.  No  reports  are  provided  to  the  programmer  on 
the  action  taken  on  different  loops  or  program  blocks  which  can  be  con¬ 
verted  into  parallel  code. 


S.&E.B.  APUBRB4.  APUNKM,  APDEBUG64 


APUBR54  is  an  useful  utility  for  cresting  an  object  program  library.  For 
large  programs  consisting  of  tens  of  modules  it  is  a  convenient  way  to  store 
the  valid  object  modules  and  to  replace  only  the  ones  which  have  been 
changed. 

APLINKS4  is  used  to  produce  the  executable  module  called  the  **.img‘ 
file  by  convention.  The  linker  accepts  both  individual  object  files  and  object 
libraries.  A  problem  encountered  with  APLINK64  is  the  erratic  terminator 
message  of  a  bad  block  encountered  in  an  object  module  which  was  success¬ 
fully  compiled  and  added  to  the  library.  This  problem  has  been  cured  every 
time  it  has  occurred  by  recompiling  the  flagged  module  and  recreating  the 
library. 

A  relevant  option  for  the  linker  is  -SYM  which  generates  a  symbol  table 
Deeded  by  the  symbolic  debugger. 

The  symbolic  debugger.  APDEBUG64,  is  a  very  useful  tool  for  program 
development  It  is  a  quite  powerful  debugger  similar  in  its  description  to  the 
fortran  debugger  running  under  the  VMS  operating  system.  An  accurate 
trace  back  including  line  numbers  in  the  pertinent  fortran  files  can  be 
obtained.  Some  of  the  other  features,  e.g..  examining  values  of  local  and  glo¬ 
bal  variables,  setting  breakpoints,  etc.,  could  not  be  tested  due  to  difficulties 
encountered  with  opening  the  symbol  table  file.  The  documentation  is  very 
vague  on  this  subject  and  various  sensible  approaches  have  lead  to  the  same 
debugger  message  of  not  finding  the  symbol  table  file.  In  these  situation  It 
baa  been  found  to  be  faster  to  use  just  the  trace  beck. 

A  conceptual  drawback  of  the  debugger  is  that  it  can  be  used  only  for 
modules  compiled  entirely  with  OPT*0.  This  restriction  deprives  the  user  of 
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any  possibility  of  debugging  parallel  code  which  is  the  primary  objective  for 
this  processor. 

B.&S.3.  APUATHM 

APUATHM  is  a  collection  of  mathematical  functions  which  operate  on 
arrays  and  scalars.  In  a  number  of  situations  it  is  advantageous  to  use  these 
efficiently  coded  vector  routines.  These  functions  prove  effective  only  when 
the  vector  length  is  sufficient  to  offset  the  start-up  time  of  the  routine.  The 
programmer  must  judge  this  on  a  routine-by-routine  case  based  on  the  time 
spent  per  array  element.  Thus,  for  VADD  which  adds  the  elements  of  two 
arrays  and  stores  the  result  in  a  third  array,  it  takes  15-30  elements  in  an 
array  for  achieving  a  50%  efficiency  in  the  vector  computation.  In  other 
words  it  takes  that  many  elements  such  that  the  computation  time  equals 
the  setup  time  for  the  function. 
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CHAPTER  3 


SP1CE2  and  CLASSIE  on  the  FPS-164 


8L1.  Introduction 

A  major  application  of  the  array  procenor  it  in  the  area  of  circuit  simu¬ 
lation. 

The  problems  encountered  during  the  implementation  of  SP1CE2  on  the 
FPS-164  are  outlined.  Although  SPICES  could  not  be  compiled  at  a  higher 
optimisation  level  than  1  its  performance  is  very  close  to  a  commercially 
available  program  which  is  another  version  of  the  same  code  tuned  for  the 
FPS-164. 

The  results  obtained  in  portii^  CLASSIE  to  the  AP  are  very  encouraging- 
The  programming  style  used  in  CLASSIE  is  geared  towards  parallel  architec¬ 
tures  and  thus  the  critical  parts  could  be  compiled  successfully  at  the 
highest  optimization  level  on  a  Fortran  compiler  still  under  development.  A 
factor  of  two  speedup  has  been  achieved  over  SP1CE2  running  on  the  FPS-164 
for  a  representative  medium-size  circuit,  a  four-bit  adder. 

In  this  chapter  a  number  of  data  on  CLASSIE  and  SP1CE2  are  presented. 
These  numbers  are  obtained  from  runs  on  both  scalar  and  vector  computers. 
8P1CE2  performs  sequential  operations  on  both  types  of  computers  and  the 
speedup  stems  from  the  differences  In  computer  architectures.  All  data 
which  refer  to  CLASSE  reflect  a  sequential  execution  of  statement*  on  a 
ecalar  computer  end  parallel  execution  on  a  vector  computer  or  array  pro¬ 
cessor.  For  a  small  circuit  of  the  basic  cell  type,  e.g.,  a  logic  gate  or  an 
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operational  amplifier,  the  only  difference  between  CLASSIE  and  SP1CE2  is  a 
different  data  organization  which  become*  a  source  of  epeed  difference. 

at  SP1CE2 

8.2. 1.  Implementation  Notee 

The  first  program  to  be  implemented  on  the  FPS-164  attached  to  a  VAX 
11/780  running  UNIX  has  been  SP1CE2  [Nage75],  [fcohe76],  [VladBl].  [ColeB3j. 
In  the  following  paragraph*  a  UNIX  operating  system  is  assumed  for  the  VAX 
unless  specified  otherwise.  This  provision  is  important  because  SP1CE2  com¬ 
piled  with  the  VMS  Fortran  compiler  runs  roughly  twice  as  fast  as  when  com¬ 
piled  with  the  UNIX  f77  compiler.  Cole  in  his  work  with  the  FPS-164  has  not 
been  concerned  primarily  with  the  simulator  performance;  the  reported 
speedup  of  3  for  a  typical  circuit  such  as  the  UA741  has  been  obtained  by 
compiling  the  program  with  APFTN64  using  OPT=D.  This  version  of  SP1CE2 
runs  on  the  AP  under  SJE.  Single  Job  Executive. 

The  next  step  in  porting  SP1CE2  to  the  AP  has  been  to  recompile  the 
entire  program  using  0PT=1.  The  executable  generated  in  this  way  did  not 
run  properly  causing  messages  such  as  'LESS  THAN  TWO  CONNECTIONS  AT 
NODE  X'  to  be  printed  for  a  perfectly  correct  input.  It  has  been  found  that 
by  selectively  recompiling  the  subroutines  which  perform  the  1/0  in  SP1CE2, 
el*..  READIN.  RUNCON.  DCOP.  OVTPVT.  PLOT,  with  0PT=0  while  preserving  the 
code  of  all  other  routines  at  0PT*1  a  worldly  executable  can  be  obtained 
Typically  this  code  which  is  referred  to  as  an  ’OPM’  version  in  spite  of  the 
above  idiosyncrasies  runs  twice  as  fast  as  the  'OPT* O'  SPICER  The  site  of 
the  'image  file'  Is  reduced  by  one  third  from  roughly  1.6  Mbytes  to  1.2 
Mbytes. 
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The  attempt  to  use  OPT* 2  for  just  the  computation-intensive  routines 
•ucb  as  the  device  model  routines  tailed.  The  eode  generated  in  this  way 
would  typically  bang  the  attached  processor  with  no  possibility  of  recovery 
short  of  rebooting  the  VAX  The  only  routines  which  have  been  successfully 
compiled  at  an  optimisation  level  higher  than  1  are  the  equation  solution 
routines.  DCDCMP  and  DCSOL  Both  have  bean  compiled  with  0PT=3  and  a 
working  SPICES  version  has  been  generated.  The  speed  improvement  over 
the  above  aOPT*l'  version  has  been  less  than  10%.  This  latter  SPICE2  version 
Is  referred  to  as  ’OPTsl*  in  Table  3.1  and  3.2. 

The  best  performance  ever  reported  for  SPICE2  on  the  FPS-1B4  is  the 
commercially  available  program  QSP1CE  [ShanB3]  which  is  typically  1.3  times 
taster  than  the  best  code  obtained  in  this  work  It  is  believed  that  for  obtain¬ 
ing  the  above  performance  a  number  of  the  SP1CE2  routines  bad  to  be  rewrit¬ 
ten  to  overcome  the  deficiencies  In  the  APFTN64  compiler  and  to  obtain 
correct  code  for  0PT=3.  Another  difference  in  QSPICE  is  that  the  linear 
aquation  solving  routines  have  been  coded  in  APAL64,  the  FPS-1B4  assembly 
language.  The  small  advantage  in  speed  for  QSP1CE  over  SP1CE2  prove*  that 
no  compute-intensive  part  of  the  program  can  be  pipelined.  This  difference 
etems  mainly  from  a  better  control  of  the  operand  Sow  in  the  sparse  equa¬ 
tion  solution  coded  in  APALB4. 

*2.2.  Performance 

Table  3.1  summarises  the  esecution  times  of  SPICE?  for  four  examples. 
The  numbers  given  represent  the  time  in  epu  seconds  needed  for  the  tran- 
Ment  analysis.  Tha  UA741  and  Adder4  arc  bipolar  circuits  while  M0SAMP2 
and  DECODER  are  an  KM OS  operational  amplifier  and  a  binary -to-octal 
daooder.  A  LEVEL* 2  device  model  bas  been  used  in  the  analysis  of  the  latter 


two  circuits. 


As  •  general  remark  on  tbs  performance  improvement  on  the  attached 
processor  It  can  be  stated  that  SP1CE2  runt  up  to  an  order  of  magnitude  fas¬ 
ter  than  on  a  VAX  11/780  with  floating-point  accelerator  and  UNIX.  For  the 
two  bipolar  circuits  the  run  times  are  typically  8  times  faster  and  for  the 
IfOS  circuit  12*15  times  faster.  The  difference  between  bipolar  circuits  and 
If  OS  can  be  explained  by  the  much  larger  percent  time  spent  in  the  model 
•valuation  for  the  latter  eompared  to  the  former.  Tbe  model  evaluation 
seems  to  benefit  more  on  tbe  AP  than  tbe  equation  solution. 

8. 3.  CLASSE 

831.  Implementation 

The  implementation  of  CLASSIE  has  been  helped  by  the  experience 
gained  from  tbe  SP1CE2  conversion. 

As  a  first  step  tbe  VAX/UNIX  version  of  CLASSIE  has  been  implemented: 
this  version  differs  from  the  high  performance  CRAY-1  version  only  in  the 
model  evaluation  routines  which  do  not  take  advantage  of  vectorization.  This 
version  bad  the  same  limitation  on  the  optimization  level  used  for  the  sem¬ 
iconductor  device  routines  as  SP1CE2. 

Tbe  next  step  included  tbe  conversion  of  the  diode  and  bipolar  vector¬ 
ised  model  routines  used  on  tbe  CRAY-1  for  tbe  FPS-164.  Conceptually  the 
'•eU-bcbeved4  DO  loops  of  the  CRAY-1  ClASSIE  code  should  produce  an 
equally  efficient  code  on  the  AP. 

A  Bret  factor  affecting  tbe  performance  hes  been  the  multiple  branching 
ueed  for  tbe  multiple  expressions  of  the  semiconductor-device  behaviour 


le 


Tb«  um|c  at  the  vector  merjc  function  'CVWCi'  on  the  CRAY-1  baa  been 
replaced  b /  IF  statements  inside  the  DO  loop.  An  equivalent  CVUCi  state- 
ment  function  [MaitB3]  could  have  been  used  which  would  have  contributed 
on  10-lSX  speed  improvement  in  the  device-evaluation  speed.  This  improve¬ 
ment  is  estimated  based  on  a  typical  vector  length  of  30. 

A  second  factor  has  affected  the  performance  of  C1ASS1E  on  the  FPS-164 
more  significantly.  It  ia  known  as  the  ‘potential  data  dependency*  problem 
which  prohibits  vectorization  (pipelining)  of  a  DO  loop  Both  in  SP1CE2  and 
CLASS'! E  all  circuit  data  are  managed  in  a  large  block  of  memory  defined  as 
an  array  VALUE  (maximum_pvailable_data_jnemory).  Different  data  can  be 
distinguished  by  table  pointers  The  compiler  however  does  not  know  that 
there  is  do  interaction  between  tbe  data  in  two  different  tables  within  the 
same  array  On  the  CRAY-1  there  is  a  ‘force  vectorization’  statement  which 
can  be  placed  in  front  of  a  loop.  Release  *D’  of  APFTN84  does  not  have  this 
feature.  This  problem  could  be  noticed  as  sood  as  the  most  time-consuming 
modules  have  been  compiled  with  0PT*3:  there  was  no  spectacular  jump  in 
performance  srhich  is  expected  when  pipelining  takes  place.  Tbe  speed 
improvement  is  between  2-4  per  DO  loop  at  0PT=3  compared  to  0PT=2  In 
the  simpler  forward  and  back  substitution  routines  for  tbe  subcircuit 
matrices  the  above  problem  has  been  overcome  by  using  the  AP MATH 64  vec¬ 
tor  functions.  This  has  resulted  in  a  23X  speed  improvement  for  this  portion 
of  tbe  code  only.  The  vector  length  for  the  above  number  is  36. 

It  is  believed  that  ell  semiconductor-modelling  routines  could  be  com¬ 
piled  at  0PT«4  in  CLASSIE  because  of  the  programming  style,  ‘well  behaved’ 
SO  loops,  and  regular  data  structures.  The  equivalent  routines  In  SP1CE2 
could  net  be  compiled  correctly  for  0PT>1.  The  equation-solving  routines  in 


CLASS  IE  could  bo  compiled  et  0PT=3  mumtum. 

11m  moot  off revating  experience  during  the  Implementation  of  CLASS! E 
has  been  the  fact  that  uaer  data  can  overwrite  the  AP's  system  software  com¬ 
ponents  or  buffers  thereof  if  there  is  not  sufficient  memory  for  loading  the 
uaer  program.  In  such  cases  a  message  from  the  linker  or.SJE  would  be  help¬ 
ful  instead  of  getting  a  trace  back  leading  into  the  system  routines. 

i 

a. 3.2  Performance 

A  running  version  of  CLASS! E  compiled  with  0PT=1  has  been  obtained  in 
a  similar  way  as  SP1CE2.  On  any  computer,  in  scalar  mode.  CLASS  IE  gains 
15-25%  in  speed  over  SP1CE2  for  medium  circuits  in  transient  analysis.  Even 
for  a  small  circuit,  such  as  the  UA741.  CLASSIE  is  20%  faster  than  SP1CE2  on 
the  attached  processor  due  to  more  regular  data  structures  and  the  possible 
optimization  associated  with  it 

In  the  DC  operating  point  analysis  CLASSIE  is  typically  twice  as  fast  as 
SP1CE2  on  medium  circuits.  The  additional  reorderuy  process  in  DC  analysis 
is  performed  on  the  interconnection  and  one  subcircuit  matrix  for  each  sub- 
circuit  type  in  CLASSIE  rather  than  a  large  overall  matrix  for  the  entire  cir¬ 
cuit  in  SP1CE2.  In  transient  analysis  there  is  no  reordering  and  this  explains 
the  smaller  speed  difference.  These  same  speed  ratios  as  above  between 
CLASSIE  and  SP1CE2  is  found  also  on  the  ITS- 104.  The  ratio  between  CLASSIE 
compiled  with  0PT=1  and  OPT=0  is  also  about  2  on  the  array  processor  as  in 
the  case  of  SP3CE2 

Table  12  lists  the  effects  of  the  different  optimization  levels  used  in  the 
compilation  of  SPICE2  and  CLASSIE  [VladS3].  The  times  in  seconds  are  for  a 
transient  analysis  of  the  bipolar  4-bit  adder  circuit  of  2S8  semiconductor 


devices.  451  equations  or  96  NAND  subcircuits.  The  transient  Analysis  has 
been  performed  from  0  to  350ns  using  the  seme  input  waveforms  as 
described  in  [VladB2].  It  should  be  noticed  that  SP1CE2  could  not  be  com¬ 
piled  successfully  at  a  higher  optimization  level  than  1. 

Table  3.3  shows  the  speedup  which  Is  obtained  by  running  CLASS1E  on 

the  CRAY-1  and  on  the  FPS-164.  The  speedup  numbers  are  relative  to  the 

\ 

performance  of  SPICE  EG5  on  the  same  hardware.  The  overall  speedup  on 
the  FPS-164  could  conceivably  be  improved  to  3  if  machine  code  generation 
would  be  implemented  for  the  linear  equation  solution.  The  speedup  in  the 
device-evaluation  part  is  estimated  to  be  better  if  the  Fortran  compiler  of 
the  systems  software  release  *£*  is  used.  This  latest  version  of  the  compiler 
is  advertised  to  have  better  pipelining  capabilities  than  the  earlier  versions. 
All  the  above  factors  can  narrow  the  gap  of  the  speedup  ratio  between  CRAY- 
1  and  FPS-164  to  roughly  1.5  in  favor  of  the  former. 


) 
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CHAPTER  4 


CONCLUSION 


The  evaluation  of  circuit  simulation  on  a  commercially  available  array 

i 

processor  ha*  been  the  purpoie  of  the  work  presented  in  this  report  Both 
the  better  known  SP1CE2  simulator  and  the  prototype  simulator  C1ASSIE  for 
LSI  circuits  have  been  ported  to  the  FPS-184  array  processor. 

The  FPS-1S4  is  a  promising  processor  for  64-bit  floating-point  scientific 
computations  from  a  hardware  architecture  point  of  view.  The  experience 
gained  porting  the  above  mentioned  programs  shows  that  the  available  sys¬ 
tem  software  and  program  development  software  is  relatively  unfriendly  and 
not  sufficiently  debugged.  The  reported  work  has  been  carried  out  using  the 
Single  Job  Executive  (SJE);  under  SJE  the  application  program  runs  solely  on 
the  AP.  Large  scientific  programs  intended  to  run  on  tbe  AP  are  written  in 
Fortran;  only  a  solid  and  well-debugged  Fortran  compiler  will  enable  the  user 
to  take  advantage  of  the  speed  offered  by  the  underlying  architecture. 

The  performance  of  tbe  two  programs  on  the  AP  is  noteworthy.  SP1CE2 
has  been  found  to  run  from  5-14  times  Taster  on  tbe  AP  than  on  a  UNIX  VAX 
11/780  with  floating-point  accelerator.  This  ratio  figure  is  between  3-7  rela¬ 
tive  to  the  same  VAX  running  VMS.  CLASS1E  runs  roughly  tarice  as  fast  as 
IP  1C£2  on  the  AP  which  brings  the  ratio  between  CLAS51E  on  tbe  AP  and 
CLASS  IE  on  VAX/VMS  eloae  to  12;  this  is  also  the  ratio  between  the  liflop  rate 
af  the  two  computers 
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ABSTRACT 

While  a  large  number  of  powerful  deefs  verification  too  La  have 
bwea  developed  foe  1C  (hagn  al  the  Irtnxiitor  and  logic  gate  lw 
eta.  there  are  »try  few  eih  con -oriented  too  La  for  architectural 
deign  and  evaluation.  As  the  number  of  gates  which  be 
implement  ad  <xi  a  single  chip  grows,  these  tools  are  becoming 
increasingly  important. 

The  PTL2  system  described  in  this  paper  is  an  interactive 
system  ftw  specifying  concurrent  digital  systems  and  analysing 
th«ar  behavior  T\t2  differs  from  other  behaviorelHeve)  nmule- 
tioo  fyAenu  in  that  the  input  specification  for  a  circuit  is  a  con¬ 
current  program.  Specification*  are  incrementally  compiled 
Into  augmented  data-flow  graphs  which  are  then  interpreted  by  a 
software  <fcte-4ow  machine. 

FTL2  includes  special  control  structures  for  describing  con¬ 
current  behavior  in  a  structured  fashion,  a  number  of  user- 
canceled  input  feature*,  and  an  extensive  macro  facility.  The 
co  ace  (A  af  nen  -share  blc  resources  la  used  to  determine  timing- 
dependent  module  seen  conflicts  In  a  value  independent 
manner  Incomplete  specifications  can  be  emulated  and  con  be 
modified  interactively. 

FTL2  has  bm  implemented  in  US’.  and  is  currently  opera¬ 
tional.  A  companion  FTL2-ha»ed  synthesis  system  is  currailly 
under  development 


1.  Behavioral -Level  Simulation  and  Synthesis 

Computer  Aids  have  been  used  with  great  sucres*  n  several 
stages  of  the  integrated  circuit  des.gn  process  However, 
although  many  ♦oc'.s  have  beer,  desigred  fer  electrical  ar.d  .cgic 
level  simulator.  ard  for  automatic  layout  of  semi-custom  chips, 
little  support  is  ava/abie  for  the  b*girr:rg  ,t-ig'*s  of  a  digital  sys¬ 
tem  design  Computer  aids  for  behavioral  evel  specification  ard 
syr.thesis  have  beer,  developed®**  ^  Cre  reason  is  that  many 
exist. rg  behavioral  simulators  dc  a  peer  jcb  of  abstracting  con¬ 
current  behavior  but  they  are  r.ot  widely  sed  m  the  IC 

industry®*^*2*  Cre  reason  s  that  many  existing  behavioral  am  .la- 
tors  do  a  poor  ,ob  of  abstrsct.rg  concurrent  behavior  Therefore, 
they  provide  little  more  information  than  car.  be  obtained  from  a 
sequential  description  tr.  a  conventional  programming  language 
In  many  cases  this  difference  is  rot  enough  ar.d  every  time  a  r.ew 
system  or  chip  is  designed  a  new  simulator  is  designed  with  it 

The  FT  Li  system  addresses  thJj  problem  by  allowing 
specif  cations  that  combine  the  characteristics  of  corxtDl  flow 
ar.d  dataflow  models  to  allow  users  to  accurately  evaluate  the 
concurrenl  nlgonthrx $  that  a  digital  system  represents  before 
choosing  a  detailed  implementation  for  it 

2.  Describing  Digital  Systems 

Digital  systems  car  be  described  by  ret  and  component  lists, 
by  l«3gic  equation  by  sequential  or  concurrent  programs,  by 
data-flow  graphs  or  by  systems  of  constraints  These  methods 
differ  m  the  degree  to  which  they  specify  the  structure  of  a 
speciflc  implementation. 

Standard  electrical  ar.d  logic  simulators  us#  the  ret  and 
component  list  method  This  type  of  description  gives  the  struc¬ 
ture  of  a  particular  implemertation  explicitly,  but  leaves  the 
behavior  of  that  system  largely  implicit  At  the  other  extreme  is 
the  method  of  specifying  digital  circuits  as  as  system  of  con¬ 
straints  Thee*  constraints  specify  the  what  the  system  accepts 
as  input  and  produces  as  output,  but  leave  the  procedures  it  goes 
through  and  the  structure  of  any  speciflc  implementations  impli¬ 
cit  The  specification  method  provided  by  F7L2  lies  in  between 
these  extremes  In  FTL2  systems  are  described  by  programs  for  a 
software  implementation  of  an  augmented  data-flow  machine  In 
the  data-flow  model  of  computation,  programs  are  described  by 
directed  graphs  Tha  varticea  of  the  graphs  represent  combina¬ 
tional  functions,  tha  edges  of  the  graph  represent  communica¬ 
tion*  paths  between  these  functions  Augmented  data-flow  is  an 
extension  of  the  data-flow  model  which  allows  storage  at  tha  nodes 
and  allows  more  general  node  firing  rules0**®*  Actually,  this 
model  is  closer  to  the  dependence  model  described  in  ***«»•  and 
soma  of  tha  existing  data-flow  machines**1®**  than  it  la  to  tha 
ctaaalca)  data-flow  model0*®79* 

The  text  representation  of  PTL2  specifications  is  a  sequent# 
of  functions  and  control  structures  called  form*  Specifications 


are  compiled  into  data-flow  graphs  by  exarururg  each  form  ir. 
sequence,  detemirjrg  the  type  of  objec*  the  fcrm  repreier’. * 
calling  a  function  to  compile  the  form  ar.d  repeat. rg  the  process 
recursively  for  each  form  it  cor  tains  "These  data  few  graphs  ire 
then  passed  to  a  software  data-flow  rr.acfur.e  'or  eval-.at.cn 

The  FTL2  software  data-flow  mach-r.e  has  r.odes  rcr  pnrut.ve 
functions,  for  storage,  ard  for  sequential  ar.d  parallel  ror.t-o!  'he 
cor.currercy  provided  is  synchronous  ard  deterministic  there¬ 
fore.  explicit  syr.chror-zatior.  of  para. lei  processes  .s  rot  required 
ar.d  systems  nrgirg  from  synchronous  ard  de’ arm. rust kc  to  asyn¬ 
chronous  ard  rcn-determinstic  car.  be  described  ar.d  emulated 
In  FTL2  it  is  possible  to  specify  tzplxcxtly  groups  cf  epe^a 
tiers  to  be  performed  either  se^usnfuaiiy  cr  ir.  parallel 
specifications  compile  d.rectly  r.to  special  control  r.cdes  in  'he 
augmented  data  few  graph  Special  fur.ct.crs  are  also  provided 
for  perform. rg  sequer.t.al  ar.d  parallel  array  operation 

2.2  The  Syntax  of  FT12 

A  1^X2  desc-.pUcr.  j  made  _p  of  a  ccl'ect.cr.  cf  /cr^us  A  *:.r  m 
consists  of  an  open  parent  we*is.  followed  by  i*ro  or  more  e'e 
(which  themselves  may  be  forms)  fcilowed  by  a  close  parer.’.tesis 
For  example 

(♦  (element  i  a)  (element  j  b ) ) 

is  a  form  wh,ch  contains  two  ether  forms,  ard  whose  va.  _e  s 

•10-bli) 

2-2  Date-Types 

The  bas.c  1a*a-types  that  FT 12  prev  d»s  u-**  s'ariard  da's 
types  of  most  programming  '<arg  ages  integer  *  -a*  ^g  ?c .r*  a"  J 
Strirgs  ard  arrays  of  these  *ypes  "he  bas  e  form  jf  a  '•ar.it*  -• 
dec'aratior,  :s 

(dec’are  (  <scope>  (<type>  )<rar?>  j))) 

Where  <sc„pe>  .a  either  'oral  or  global,  type  s  .rteger  real  ;r 
string.  <r.ame>  .s  ar.y  alphar.u.merc  name.  arJ  'he  trvej  mem 
that  there  car.  be  zero  or  mere  occurrences  cf  tr.e  s/:.t actic  .r.t 
inside  of  them 

Z3  Modules 

Modules  are  the  mam  partitioning  facility  provided  .r.  T~ \2 
The  syntax  for  a  module  declaration  is 
(module  <rame> 

(declare  (inputs  j<ir.puts-r.ames>0 

(outputs  t  <outputs  namea>  1)) 
f  <forms>| 

) 

Modules  .r.  FX2  also  prov.de  a  means  for  resource  manage¬ 
ment  When  declarat.on  of  a  user-deflr.ed  module  s  read  t 
defines  a  prototype  of  that  module  ar.d  creates  a  sirg’e  .rjtar.ee 
with  the  same  name  Whenever  a  module  is  invoked  at  rur.-t.me, 
FTL2  checks  to  determine  if  that  nodule  *  in  ,se  If  ar.  attempt 
is  made  to  re-use  a  module  which  is  already  in  use.  an  error  s 
repor'ed  to  the  'jer  Xus.  timing  errors  which  result  ir.  resource 
conflict  are  detected  ar.d  reported 

2.4  Macro* 

FTL2  provides  ■  macro  facility  which  can  be  used  when 
copies  of  a  module  are  needed,  as  well  as  rases  where  encapsula¬ 
tion  without  resource  management  is  desired  Macros  are 
declared  as 


(macro  <name>  (<paraneterl>|<parameler>0 
<forms> 

) 

9L  Concurrency  Modes 

There  are  two  concurrency  modes  in  F7L2.  lockstep  ar.d  PC 
These  modes  provide  different  ways  of  dealing  with  time  ar.d 
assignment  In  Lockstep  mode,  every  primitive  operation.  Includ¬ 
ing  assignment,  is  defir  ed  to  take  or.e  unit  of  time  This  mod*  has 
the  advantage  of  always  providing  fixed  times  for  operations,  but 
has  the  disadvantage  that  balancing  delaya  must  be  don*  by  the 
user 


The  other  mode  ts  called  "PC*  mode  and  combiner*  the 
Tetch-Stor*''  style  of  assignment  described  in7**7**  with  a  hierar- 
cha!  method  of  structuring  time  In  PC  mode,  each  atate  is 
divided  into  two  p haves  Computation  la  performed  during  the 
first  phase  and  registers  are  set  to  their  new  values  during  the 
second  one  Aa  sc  example,  consider  the  following  fragment 
(parallel 

(set  c)) 

(■•t  b  a)) 

Under  lockstep  mode,  the  value  of  b  et  Lhe  end  of  the  block  is  the 
original  value  of  a,  call  It  old-a  and  the  value  of  a  at  the  end  of  the 
block  will  be  old_a  ♦  old-s  *  c  The  reason  for  this  is  that  b  is  set 
to  the  value  of  a  by  the  second  assignment  statement  before  it  is 
fetched  by  the  first  one  Under  PC  mod*,  the  value  of  b  at  the  end 
at  the  block  is  the  original  value  of  a  but  the  value  of  a  is  old^  ♦ 
old_b  *  c 

PC  mode  also  provides  a  different  model  of  time  The  state¬ 
ments  in  the  mam  module  of  the  user's  description  are  defined 
execute  one  basic  time  unit  apart  When  necessary,  time  is  bro¬ 
ken  up  into  finer  units  The  semantics  of  a  parallel  block  are  that 
all  statements  inside  of  it  begin  execution  at  that  same  time,  and 
that  the  block  is  exited  when  ail  statements  ms.de  it  are  finished 
The  semantic*  of  a  serial  block  are  that  it  executes  the  first  state¬ 
ment  ir.  the  block,  waits  for  that  statement  to  complete,  and  then 
executes  the  next  statement  in  the  block  and  continues  ir.  this 
way  until  all  of  Lhe  statements  in  the  block  have  beer,  executed 
To  preserve  these  semantics  whsr.  a  serial  block  is  nested  inside  a 
parallel  or.e.  it  is  necessary  that  the  inner  serial  block  have  a  finer 
grar.Janty  of  time  than  the  outer  one  Because  the  number  of 
statements  ir.  the  inner  block  are  r.ot  known  as  the  compiler  is 
examining  the  input,  time  is  treated  as  a  mixed-radix  floating¬ 
point  number  Whenever  a  serial  block  occurs  inside  a  parallel 
ore.  a  new  digit  (of  unknown  radix)  is  added  onto  the  least 
Significant  er.d  of  the  state  Then,  each  element  of  the  serial 
b.ock  is  ass.gr.ed  a  time  (state)  which  is  or.e  greater  in  that  digit 
pas.tion  than  the  previous  element  Wfcer.  hardware  is  actually 
generated  the  mixed  redix  fioeting-poir.t  numbers  are  converted 
lo  integer  state  cumbers  and  the  stales  are  then  re-coded  using 
or  algorithm  such  as0***34  to  generate  an  efTicient  state- 
ngrmer.l  for  a  PLA  based  controller 

4.  Data-flow  Based  Synthesis 

The  previous  sections  of  tins  report  have  described  the  use 
a!  exp  cit  concurrency  ir.  F7L2  It  is  also  pcos.ble  to  use  implicit 
rcr  r  ^rrer.c)  ar.d  tc  let  the  FTL2  compiler  determine  where  it  is 
puss  t  e  lo  have  operations  go  on  ir.  paralle.  The  F7L2  compiler 
pe-forms  a  data-flow  analysis  of  the  specification  ar.d  determines 
the  execution  orderirg  of  statements  based  or.  data- 
deper.der.cies.  as  irT**80* 

Because  F7L2  combines  explicit  concurrency  ar.d  data-flow 
analysis  it  is  able  to  detect  errors  which  previous  systems  could 
r.ot  detect  These  errors  fail  into  two  categories  First,  the  con- 
p  ier  may  detect  that  two  operations  car.  go  or.  in  parallel  ir.  a 
case  where  the  user  has  specified  that  they  must  proceed  sequen- 
t  ally  Second,  the  compUer  may  delect  that  two  operations  are 
dependent  er.d  must  proceed  sequentially  ever,  though  the  user 
has  specified  that  they  are  to  go  or  ir.  parallel  In  l ha  fu-st  case, 
there  are  two  possibilities  Or.e  is  that  the  designer  has  r.ot  recog¬ 
nized  a  passible  optimization,  and  therefore  may  get  a  system 
w-.th  less  than  the  highest  possible  performance  "The  other  is  that 
there  ts  a  constraint,  perhaps  critical,  which  the  designer  has  left 
out  of  the  specification  In  the  second  case,  the  extra  dependency 
ard  the  serialization  it  requires  may  or  may  not  be  critical 
depending  or.  whether  or  not  it  occurs  on  the  critical  path, 
k  Simulator  Implementation 

Emulation  of  the  augmented  data  flow  graphs  ir.  FTL2  is  per¬ 
formed  by  passirg  messages  between  the  nodes  of  the  data- 
g*-aphs  The  data  Cow  graph  of  an  FTL2  specification  is  unusual  in 
that  it  is  a  tree,  rather  than  a  general  cyclic  graph  Sequential 
behavior  is  provided  by  eequcctiat  evaluation  of  aub-traes  and 
looping  is  provided  by  repeated  evaluation  of  sub-trees 

Concurrent  message  passing  is  simulated  through  the  use  of 
a  central  message  queue  and  function  dispatcher  The  queue  con¬ 
tains  ordered  (message,  node)  pairs  The  main  simulator  loop 
removes  pairs  from  the  queue  and  calls  the  specified  message 
hardier  with  the  given  node  as  its  argument  The  side- effects  of  a 
message  handler  are  highly  restricted  A  message  handler  for  a 
code  may  only  change  the  state  of  that  node,  the  state  of  that 
node's  parent,  or  send  massages  lo  other  nodes 

When  e  module  is  defined,  an  augmented  date- flow  graph  is 
created  for  it  When  the  module  la  invoked,  the  graph  is  checked 
tc  determine  if  It  is  already  in  uea  If  it  Is.  an  error  message  is 
printed  which  specifies  where  and  when  (In  simulated  time)  the 
conflict  occurred 


&  faailts  and  Ccmdusiana 

The  interactive  top-level  for  FTL2  and  the  software 
augmented -data-flow  machine  have  been  implemented  ar.d  the 
system  has  been  used  to  describe  and  evaluate  several  designs, 
including  the  RISC  microprocessor***182*  ar.d  s  perfect-shuffle  net¬ 
work  node  chip  The  F7L2  system  is  currently  being  used  to 
describe  a  special -purpose  augmented  data-flow  machine  for  per¬ 
forming  iterated  timing  analysis5*183* 
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